Combining Light-Weight Retrieval Strategies for Robust Text Categorization
نویسنده
چکیده
We report on the development of a general purpose text categorization system designed to automatically assign biomedical categories to any input text. Unlike usual automatic text categorization systems, which rely on data-intensive models extracted from large sets of training data, our categorizer is largely dataindependent and so it can be used when training data are not available provided that a small set of instances is available for tuning the system. Like it is usual with information retrieval engines, the tool provides a ranked list of categories, which can then be interactively filtered by the user.
منابع مشابه
Feature Preparation in Text Categorization
Text categorization is an important application of machine learning to the field of document information retrieval. Most machine learning methods treat text documents as a feature vectors. We report text categorization accuracy for different types of features and different types of feature weights. The comparison of these classifiers shows that stemmed or un-stemmed single words as features giv...
متن کاملIntegrating a Structured-Text Retrieval System with an Object-Oriented Database System
We describe the integration of a structured-text retrieval system (TextMachine) into an object-oriented database system (OpenODB). Our approach is a light-weight one, using the external function capability of the database system to encapsulate the text retrieval system as an external information source. Yet, we are able to provide a tight integration in the query language and processing; the us...
متن کاملLearning-Free Text Categorization
In this paper, we report on the fusion of simple retrieval strategies with thesaural resources in order to perform large-scale text categorization tasks. Unlike most related systems, which rely on training data in order to infer text-to-concept relationships, our approach can be applied with any controlled vocabulary and does not use any training data. The first classification module uses a tra...
متن کاملCombining image content and annotated text for medical image categorization and retrieval
The richness of health-information available on-line requires the development of efficient information retrieval methods. The CISMeF heath-catalogue provides indexing and searching capabilities for healthresources. Medical images are representing a significant part of on-line medical knowledge and a valuable component of diagnosis and teaching. In this context, a combined text and image extract...
متن کاملAn Improved Algorithm of Bayesian Text Categorization
Text categorization is a fundamental methodology of text mining and a hot topic of the research of data mining and web mining in recent years. It plays an important role in building traditional information retrieval, web indexing architecture, Web information retrieval, and so on. This paper presents an improved algorithm of text categorization that combines the feature weighting technique with...
متن کامل